update size
Meta-LearningtheSearchDistributionofBlack-Box RandomSearchBasedAdversarialAttacks
A very promising direction in the field of black-box adversarial attacks are randomized search schemes for crafting adversarial examples [1, 23, 24]. Combining random search with specific update proposal distributions allows to achieve state-of-the-art black-box efficiency for different threat models such as` and `2 [1], `1 [25], `0, adversarial patches, and adversarial frames [24].
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- Asia > Middle East > Jordan (0.04)
- Transportation (0.56)
- Information Technology (0.36)
- North America > United States (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Switzerland (0.04)
- (2 more...)
Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training
Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size $\Delta \mathbf{w}_t = \eta_t \mathbf{u}_t$ early in training by using lower values for the learning rate $\eta_t$. In this work we argue that warmup benefits training by keeping the overall size of $\Delta \mathbf{w}_t$ limited, counteracting large initial values of $\mathbf{u}_t$. Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: *Why and by which criteria are early updates $\mathbf{u}_t$ too large?* We analyze different metrics for the update size including the $\ell_2$-norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular updates as well as a limited critical batch size early in training. Finally, we show that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize $\mathbf{u}_t$ based on the aforementioned metrics.
- North America > United States (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Switzerland (0.04)
- (2 more...)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Security & Privacy (0.87)
- Transportation > Air (0.65)
- Government > Military (0.64)
Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training
Kosson, Atli, Messmer, Bettina, Jaggi, Martin
Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size $\Delta \mathbf{w}_t = \eta_t \mathbf{u}_t$ early in training by using lower values for the learning rate $\eta_t$. In this work we argue that warmup benefits training by keeping the overall size of $\Delta \mathbf{w}_t$ limited, counteracting large initial values of $\mathbf{u}_t$. Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: Why and by which criteria are early updates $\mathbf{u}_t$ too large? We analyze different metrics for the update size including the $\ell_2$-norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular updates as well as a limited critical batch size early in training. Finally, we show that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize $\mathbf{u}_t$ based on the aforementioned metrics.
- North America > United States (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Switzerland (0.04)
- (2 more...)
SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
Babakniya, Sara, Elkordy, Ahmed Roushdy, Ezzeldin, Yahya H., Liu, Qingfeng, Song, Kee-Bong, El-Khamy, Mostafa, Avestimehr, Salman
Transfer learning via fine-tuning pre-trained transformer models has gained significant success in delivering state-of-the-art results across various NLP tasks. In the absence of centralized data, Federated Learning (FL) can benefit from distributed and private data of the FL edge clients for fine-tuning. However, due to the limited communication, computation, and storage capabilities of edge devices and the huge sizes of popular transformer models, efficient fine-tuning is crucial to make federated training feasible. This work explores the opportunities and challenges associated with applying parameter efficient fine-tuning (PEFT) methods in different FL settings for language tasks. Specifically, our investigation reveals that as the data across users becomes more diverse, the gap between fully fine-tuning the model and employing PEFT methods widens. To bridge this performance gap, we propose a method called SLoRA, which overcomes the key limitations of LoRA in high heterogeneous data scenarios through a novel data-driven initialization technique. Our experimental results demonstrate that SLoRA achieves performance comparable to full fine-tuning, with significant sparse updates with approximately $\sim 1\%$ density while reducing training time by up to $90\%$.
- North America > United States > California (0.14)
- North America > United States > Virginia (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Meta-Learning the Search Distribution of Black-Box Random Search Based Adversarial Attacks
Yatsura, Maksym, Metzen, Jan Hendrik, Hein, Matthias
Adversarial attacks based on randomized search schemes have obtained state-of-the-art results in black-box robustness evaluation recently. However, as we demonstrate in this work, their efficiency in different query budget regimes depends on manual design and heuristic tuning of the underlying proposal distributions. We study how this issue can be addressed by adapting the proposal distribution online based on the information obtained during the attack. We consider Square Attack, which is a state-of-the-art score-based black-box attack, and demonstrate how its performance can be improved by a learned controller that adjusts the parameters of the proposal distribution online during the attack. We train the controller using gradient-based end-to-end training on a CIFAR10 model with white box access. We demonstrate that plugging the learned controller into the attack consistently improves its black-box robustness estimate in different query regimes by up to 20% for a wide range of different models with black-box access. We further show that the learned adaptation principle transfers well to the other data distributions such as CIFAR100 or ImageNet and to the targeted attack setting.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Middle East > Jordan (0.04)
- Transportation > Air (1.00)
- Information Technology > Security & Privacy (0.86)